Template-Independent Web Object Extraction
نویسندگان
چکیده
There are various kinds of objects embedded in static Web pages and online Web databases. Extracting and integrating these objects from the Web is of great significance for Web data management. The existing Web information extraction (IE) techniques cannot provide satisfactory solution to the Web object extraction task since objects of the same type are distributed in diverse Web sources, whose structures are highly heterogeneous. The classic information extraction (IE) methods, which are designed for processing plain text documents, also fail to meet our requirements. In this paper, we propose a novel approach called Object-Level Information Extraction (OLIE) to extract Web objects. This approach extends a classic IE algorithm, Conditional Random Fields (CRF), by adding Web-specific information. It is essentially a combination of Web IE and classic IE. Specifically, visual information on the Web pages is used to select appropriate atomic elements for extraction and also to distinguish attributes, and structured information from external Web databases is applied to assist the extraction process. The experimental results show OLIE can significantly improve the Web object extraction accuracy.
منابع مشابه
Site-Independent Template-Block Detection
Detection of template and noise blocks in web pages is an important step in improving the performance of information retrieval and content extraction. Of the many approaches proposed, most rely on the assumption of operating within the confines of a single website or require expensive hand-labeling of relevant and non-relevant blocks for model induction. This reduces their applicability, since ...
متن کاملInférer des Objets Sémantiques du Web Structuré
This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose, and other statistics. Next, in the context of dynamically-generated Web...
متن کاملWeb Template Extraction Based on Hyperlink Analysis
Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important ...
متن کاملEffective and Enhanced method for Template Extraction from Heterogeneous Web Pages
To achieve high productivity publishing the web pages are automatically evaluated using common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. Cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. This process propo...
متن کاملSite-Level Web Template Extraction Based on DOM Analysis
One of the main development resources for website engineers are Web templates. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important ...
متن کامل